Unsupervised Clustering of Text Entities in Heterogeneous Grey Level Documents
نویسندگان
چکیده
This paper presents a new method of functional classification of text blocks on a document. It is based on texture analysis and unsupervised classification. Texture is used here to define different classes of text blocks in the document and to direct a possible way of exploration from the most eye-catching data to the less significant text block. The typographical properties of blocks are characterized by two main discriminating primitives : the complexity of the text drawing and the structural relief of the block. This analysis is the starting point of a threeclasses categorization into functional families (main headings, sub-headings and text paragraphs). Each block of text is described and classified through a labeling process based on a 3D-feature space using the two previous features (complexity and structural relief) and a third one among pattern primitives, blocks size and location in the document. This method allows a first approach to a global context free classification of documents.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملPAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملAn automatic approach for ontology-based feature extraction from heterogeneous textualresources
Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormou...
متن کاملTopic Oriented Semi-supervised Document Clustering
In our study on developing a text mining prototype system, it is needed to group documents according to author’s need. However, Traditional documents clustering are usually considered an unsupervised learning. It cannot effectively group documents under user’s need. To solve this problem, we propose a new documents clustering approach. The main contributions include: (1) Describes user’s need b...
متن کاملHeterogeneous Transfer Learning for Image Clustering via the SocialWeb
In this paper, we present a new learning scenario, heterogeneous transfer learning, which improves learning performance when the data can be in different feature spaces and where no correspondence between data instances in these spaces is provided. In the past, we have classified Chinese text documents using English training data under the heterogeneous transfer learning framework. In this pape...
متن کامل